/tmp/ipykernel_2208/188905684.py:12: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display from IPython.core.display import display, HTML
Performance Analysis¶
We'll start by analyzing our best performing model after initial tuning. After ~100-200 runs to select the:
- best performing scheduler {explain which}
- optimal set of transformations/augmentations {explain which}
{we've arrived at this configuration by only trying to maximize the high level model performance:
- {total weighted loss (combined from normalized gender and age prediction loss}
- gender predicitons accuracy
- MAE for age predictions
Hello, world!
Gender Prediction Performance¶
| Female | Male | Overall | |
|---|---|---|---|
| Support | 2353.000000 | 2387.000000 | 4740.000000 |
| Accuracy | 0.931013 | 0.931013 | 0.931013 |
| Precision | 0.924204 | 0.937925 | 0.931065 |
| Recall | 0.937952 | 0.924173 | 0.931062 |
| F1-score | 0.931027 | 0.930998 | 0.931013 |
| AUC-ROC | NaN | NaN | 0.980522 |
| PR-AUC | NaN | NaN | 0.977997 |
| Log Loss | NaN | NaN | 0.178862 |
| Brier Score | NaN | NaN | NaN |
Cell In[19], line 1 {TODO: add fancy cofusion matrix ^ SyntaxError: invalid syntax. Perhaps you forgot a comma?
Age Predictions¶
| Value | |
|---|---|
| MAE | 5.105901 |
| MSE | 54.144762 |
| RMSE | 7.358312 |
| R-squared | 0.862191 |
| MAPE | 25.161557 |
| True Age | Predicted Age | |
|---|---|---|
| Mean | 33.308439 | 32.147823 |
| Median | 29.000000 | 28.514690 |
| Min | 1.000000 | -2.139822 |
| Max | 116.000000 | 95.214233 |
| Total | Correct | Accuracy | |
|---|---|---|---|
| Age_Group | |||
| 0-4 | 444 | 307 | 0.6914 |
| 4-14 | 261 | 215 | 0.8238 |
| 14-24 | 636 | 604 | 0.9497 |
| 24-30 | 1228 | 1187 | 0.9666 |
| 30-40 | 865 | 837 | 0.9676 |
| 40-50 | 399 | 393 | 0.9850 |
| 50-60 | 420 | 409 | 0.9738 |
| 60-70 | 229 | 218 | 0.9520 |
| 70-80 | 156 | 149 | 0.9551 |
| 80+ | 102 | 94 | 0.9216 |
| Total | Correct | Accuracy | |
|---|---|---|---|
| Age_Group | |||
| 0-4 | 444 | 306 | 0.6892 |
| 4-14 | 261 | 221 | 0.8467 |
| 14-24 | 636 | 615 | 0.9670 |
| 24-30 | 1228 | 1188 | 0.9674 |
| 30-40 | 865 | 846 | 0.9780 |
| 40-50 | 399 | 394 | 0.9875 |
| 50-60 | 420 | 411 | 0.9786 |
| 60-70 | 229 | 223 | 0.9738 |
| 70-80 | 156 | 149 | 0.9551 |
| 80+ | 102 | 96 | 0.9412 |
We can see that gender prediction accuracy is very high across all ranges except young children. Realistically it's unlikely we can do anything about that, facial features of babies tend to be very different from adults. Potentially it might be worth investigating building a separate model for them but it's unlikely that it would achived very high performance either.
| Age_Group | Support | Age_MAE | Age_MSE | Age_RMSE | Age_R-squared | Age_MAPE | |
|---|---|---|---|---|---|---|---|
| 0 | 0-4 | 444 | 1.588580 | 11.325658 | 3.365361 | -9.241579 | 99.745904 |
| 1 | 4-14 | 261 | 4.011655 | 34.033093 | 5.833789 | -3.743251 | 46.700869 |
| 2 | 14-24 | 636 | 4.171022 | 32.965802 | 5.741585 | -2.937213 | 21.156784 |
| 3 | 24-30 | 1228 | 3.720786 | 30.006521 | 5.477821 | -10.167695 | 13.674633 |
| 4 | 30-40 | 865 | 6.270144 | 63.924114 | 7.995256 | -7.162335 | 17.644973 |
| 5 | 40-50 | 399 | 7.749943 | 96.742555 | 9.835779 | -10.194667 | 16.942367 |
| 6 | 50-60 | 420 | 7.311122 | 91.486462 | 9.564856 | -11.248783 | 13.271226 |
| 7 | 60-70 | 229 | 6.725516 | 80.393407 | 8.966237 | -8.236708 | 10.369088 |
| 8 | 70-80 | 156 | 7.617475 | 105.892985 | 10.290432 | -11.530508 | 10.082188 |
| 9 | 80+ | 102 | 8.947648 | 173.258202 | 13.162758 | -3.118748 | 9.777900 |
| Age_Group | Support | Age_MAE | Age_MSE | Age_RMSE | Age_R-squared | Age_MAPE | |
|---|---|---|---|---|---|---|---|
| 0 | 0-4 | 444 | 1.014360 | 10.863007 | 3.295908 | -8.823212 | 57.902984 |
| 1 | 4-14 | 261 | 3.195415 | 26.828285 | 5.179603 | -2.739105 | 39.937090 |
| 2 | 14-24 | 636 | 3.587664 | 25.381818 | 5.038037 | -2.031433 | 17.745698 |
| 3 | 24-30 | 1228 | 4.186014 | 39.485719 | 6.283766 | -13.695620 | 15.399167 |
| 4 | 30-40 | 865 | 6.002176 | 59.482523 | 7.712491 | -6.595198 | 16.917257 |
| 5 | 40-50 | 399 | 6.352205 | 64.919602 | 8.057270 | -6.512241 | 13.902723 |
| 6 | 50-60 | 420 | 6.273703 | 72.882176 | 8.537106 | -8.757924 | 11.377895 |
| 7 | 60-70 | 229 | 6.505069 | 69.960602 | 8.364245 | -7.038043 | 10.017467 |
| 8 | 70-80 | 156 | 6.595112 | 74.531350 | 8.633154 | -7.819429 | 8.710363 |
| 9 | 80+ | 102 | 8.218197 | 167.143217 | 12.928388 | -2.973381 | 8.948656 |
LIME¶
Solving Age Balancing¶
<module 'Notebooks.utils.error_analysis' from '/mnt/v/projects/DL_s3/Notebooks/utils/error_analysis.py'>
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
Figure size: 840x2240 px
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
['dataset/test_2_folds_last/111_1_0_20170120134646399.jpg.chip.jpg', 'dataset/test_2_folds_last/1_1_0_20170109194452834.jpg.chip.jpg', 'dataset/test_2_folds_last/9_0_0_20170110225030430.jpg.chip.jpg', 'dataset/test_2_folds_last/8_0_1_20170114025855492.jpg.chip.jpg', 'dataset/test_2_folds_last/41_1_1_20170117021604893.jpg.chip.jpg']
Most Misclassified Images (both gender/age)¶
Figure size: 840x1400 px
Figure size: 840x1400 px
Misclassified Gender¶
Looking at gender specifically it's actually likely that our model performs better than the summarized results might imply.
The images above showcases where out model was least accurate, and we can see that all except one are likely cases of data being mislabeled in the original dataset (OR it's labeled accurately based on those individuals self-identity)
Figure size: 840x1960 px
We can see two main issues:
Some images are poor quality or are strongly cropped. It's possible that we can solve this problem by using heuristics in preprocessing to exclude these samples from trained and test samples.
We can see certain patterns related to race and age. The model is having issue classifying face of people who are non-white, possibly due to different facial features or skin color (although grayscale transform should partially fix that). Also, it's struggling with either very old people or children/babies possibly because of too small sample size and relatively more "androgynous" facial features in those groups. We'll attempt to fix this using augmentation in combination with oversampling (i.e. we'll use transforms to create additional samples for age bins which are underrepresented, additionally we'll use some of the color analysis from the EDA to also oversample the images of under-represented skin colors)
Many samples are potentially mislabeled. It's possible that some of the samples are of people who self-identify as male/female while still retaining facial features, hairstyles etc. of the opposite gender. Or they are just mislabeled. In either case this part would be the hardest to solve.
Filtering Out "Invalid" Samples¶
We'l use a mix of metrics to try and determine which images are very poor quality, lack enough details to proper classification etc. :
BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator):
A no-reference image quality assessment method. Uses scene statistics of locally normalized luminance coefficients to quantify possible losses of "naturalness" in the image due to distortions. Operates in the spatial domain.
Laplacian Variance:
A measure of image sharpness/blurriness. Uses the Laplacian operator to compute the second derivative of the image. Measures the variance of the Laplacian-filtered image.
FFT-based Blur Detection:
Uses Fast Fourier Transform to analyze the frequency components of an image. Applies a high-pass filter in the frequency domain and measures the remaining energy.
See the Data Analysis notebook for more details.
BRISQUE + Laplacian Variance¶
One obvious major shortcoming of this approach is that we're basically excluding a significant proportion of samples basically just because our model performs very poorly on them.
While {TODO}
A production pipeline might be:
- Check if image is valid using heuristics (e.g. telling the user to position the camera better etc.)
Augmentation Based Oversampling¶
We'll use augmentation/transforms combined with oversampling to increase the number of samples in underrepresented classes. This approach:
- allows us to preserve original data characteristics while introducing variability
Potential issues:
- Risk of overfitting to augmented versions of underrepresented samples
- Possibility of introducing unintended biases if augmentation isn't carefully balanced
- May not fully address underlying dataset biases
- Requires careful monitoring to ensure improved performance across all age groups
Comparing Both Models¶
Let's look at samples that were miss-classified using the initial model but are now correct in the new model:
| image_path | true_age | age_pred_base | age_error_base | age_pred_improved | age_error_improved | error_reduction | |
|---|---|---|---|---|---|---|---|
| 4421 | 80_1_0_20170110131953974.jpg.chip.jpg | 80 | 30.372637 | 49.627363 | 68.726158 | 11.273842 | 38.353521 |
| 3372 | 46_1_3_20170120140919993.jpg.chip.jpg | 46 | 12.257863 | 33.742137 | 46.783295 | 0.783295 | 32.958842 |
| 3788 | 55_0_0_20170117204213768.jpg.chip.jpg | 55 | 17.566719 | 37.433281 | 44.354115 | 10.645885 | 26.787395 |
| 2525 | 34_1_2_20170108224608753.jpg.chip.jpg | 34 | 6.370270 | 27.629730 | 33.015022 | 0.984978 | 26.644752 |
| 2910 | 38_1_0_20170117154129371.jpg.chip.jpg | 38 | 14.315865 | 23.684135 | 37.683617 | 0.316383 | 23.367752 |
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
0%| | 0/500 [00:00<?, ?it/s]
Figure size: 840x1120 px
Figure size: 840x1680 px
Of course, we have specifically selected the best case examples (i.e. where the performance of model has improved the most) which probably gives a much to optimistic picture of the overall improvement (relative to overal increase in accuracy/MAE which is not as signficant).
Instead, we've selected some of the samples our initial model failed on that were unlikely to be mislabeled: